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Abstract — Efforts In this paper seek to combine graph theory 
with adaptive dynamic programming (ADP) as a reinforcement 
learning (RL) framework to determine forward-in-time, real- 
time, approximate optimal controllers for distributed multi-agent 
systems with uncertain nonlinear dynamics. A decentralized 
continuous time-varying control strategy is proposed, using only 
local communication feedback from two-hop neighbors on a 
communication topology that has a spanning tree. An actor- 
critic-identifler architecture is proposed that employs a nonlinear 
state derivative estimator to estimate the unknown dynamics 
online and uses the estimate thus obtained for value function 
approximation. Simulation results demonstrate the applicability 
of the proposed technique to cooperatively control a group of 
five agents. 

I. Introduction 

Combined efforts from multiple autonomous agents can 
yield tactical advantages including: improved munitions ef- 
fects; distributed sensing, detection, and threat response; and 
distributed communication pipelines. While coordinating be- 
haviors among autonomous agents is a challenging prob- 
lem that has received mainstream focus, unique challenges 
arise when seeking autonomous collaborative behaviors in 
low bandwidth communication environments. For example, 
most collaborative control literature focuses on centralized 
approaches that require all nodes to continuously communicate 
with a central agent, yielding a heavy communication demand 
that is subject to failure due to delays, and missing informa- 
tion. Furthermore, the central agent requires to carry enough 
computational resources on-board to process the data and to 
generate command signals. These challenges motivate the need 
for a decentralized approach where the nodes only need to 
communicate with their neighbors for guidance, navigation 
and control tasks. 

Reinforcement learning (RL) allows an agent to learn the 
optimal policy by interacting with its environment, and hence, 
is useful for control synthesis in complex dynamical systems 
such as a network of agents. Decentralized algorithms have 
been developed for cooperative control of networks of agents 
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with finite state and action spaces in fTl-p). See p) for 
a survey. The extension of these techniques to networks of 
agents with infinite state and action spaces and nonlinear 
dynamics is challenging due to difficulties in value function 
approximation, and has remained an open problem. 

As the desired action by an individual agent depends on the 
actions and the resulting trajectories of its neighbors, the error 
system for each agent becomes a complex nonautonomous 
dynamical system. Nonautonomous systems, in general, have 
non-stationary value functions. As non-stationary functions are 
difficult to approximate using parametrized function approx- 
imation schemes such as neural networks (NNs), designing 
optimal policies for nonautonomous systems is not trivial. To 
get around this challenge, differential game theory is often 
employed in multi-agent optimal control, where a solution to 
the coupled Hamilton-Jacobi-Bellman (HJB) equation (c.f. 151) 
is sought. As the coupled HJB equations are difficult to solve, 
some form of generalized policy iteration or value iteration B 
is often employed to get an approximate solution. It is shown 
in results such as ||5|, |[7|-pT[ that approximate dynamic 
programming (ADP) can be used to generate approximate 
optimal policies online for multi-agent systems. As the HJB 
equations to be solved are coupled, all of these results have a 
centralized control architecture. 

Decentralized control techniques focus on finding control 
policies based on local data for individual agents that col- 
lectively achieve the desired goal, which, for the problem 
considered in this effort, is consensus to the origin. Various 
methods have been developed to solve the consensus problem 
for linear systems with exact model knowledge. An optimal 
control approach is used in |121 to achieve consensus while 
avoiding obstacles. In p3| , an optimal controller is developed 
for agents with known dynamics to cooperatively track a 
desired trajectory. In |14|, an optimal consensus algorithm 
is developed for a cooperative team of agents with linear 
dynamics using only partial information. A value function 
approximation based approach is presented in |15| for co- 
operative synchronization in a strongly connected network 
of agents with known linear dynamics. It is also shown in 
1 15 1 that the obtained policies are in a cooperative Nash 
equilibrium. 

For nonlinear systems, a model predictive control approach 
is presented in |16|, however, no stability or convergence 
analysis is presented. A stable distributed model predictive 
controller is presented in |17| for nonlinear discrete-time 



systems with known nominal dynamics. Asymptotic stability is 
proved without any interaction between the nodes, however, a 
nonlinear optimal control problem need to be solved at every 
iteration to implement the controller. Decentralized optimal 
control synthesis for consensus in a topological network of 
agents with continuous-time uncertain nonlinear dynamics has 
remained an open problem. 

In this result, an ADP-based approach is developed to 
solve the consensus problem for a network topology that has 
a spanning tree. The agents are assumed to have nonlinear 
control-affine dynamics with unknown drift vectors and known 
control effectiveness matrices. An identifier is used in con- 
junction with the controller enabling the algorithm to find 
approximate optimal decentralized policies online without the 
knowledge of drift dynamics. This effort thus realizes the 
actor-critic-identifier (ACI) architecture (c.f. |[T8|, |[T9)) for 
networks of agents. Simulations are presented to demonstrate 
the applicability of the proposed technique to cooperatively 
control a group of five agents. 



II. Graph Theory Preliminaries 

Let N = {/3i,/32,--- ,(3n} denote a set of N agents 
moving in the state space S C M". The objective is for the 
agents to reach a consensus state. Without loss of generality, 
let the consensus state be the origin of the state space, i.e. 
5 3 xq = 0. To aid the subsequent design, the agent 
/3o (henceforth referred to as the leader) is assumed to be 
stationary at the origin. The agents are assumed to be on a 
network with a fixed communication topology modeled as a 
static directed graph (i.e. digraph). 

Each agent forms a node in the digraph. If agent f3j can 
communicate with agent /3j then there exists a directed edge 
from the j*'* to the i*^ node of the digraph, denoted by 
the ordered pair (/3j, A) G Af x Af. Let E C Af x Af 
denote the set of all edges. Let there be a positive weight 
Uij E M. associated with each edge {/3j, Pi). Note that a^- ^ 
if and only if {f3j,/3i) S E. The digraph is assumed to 
have no repeated edges i.e. {f3i,f3i) ^ E,\/i, which implies 
an = 0,Vi. Note that a^o denotes the edge weight (also 
referred to as the pinning gain) for the edge between the 
leader and an agent f3i. Similar to the other edge weights, 
flio 7^ if and only if there exists a directed edge from 
the leader to the agent i. The neighborhood set of agent 
Pi is denoted by Afi defined as Afi ^ {j \ {(ij^Pi) G E}. 
To streamline the analysis, the graph connectivity matrix 
A G K^^^ is defined a& A ^ [a^j | i, j = 1, • • • ,7V], the 
pinning gain matrix Aq G K^^^ is a diagonal matrix defined 
as Aq = diag (a,o) | i = 1, • • • , iV, the matrix V G R^""^ 
is defined as I? = diag (di) , where di — J^jeAf- "^'J' '^^'^ ^^^ 
graph Laplacian matrix C G R^^^ is defined as £ — V — A. 
The graph is said to have a spanning tree if given any node 
Pi, there exists a directed path from the leader Pq to Pi. For 



notational brevity, a linear operator T^ ((•)) is defined as 

T. ((•))- I E«»^ ((•).-(■).)+ "^" ((■)»))■ ^^^ 

III. Problem Definition 
Let the dynamics of each agent be described as 

ii = fi (xi) + gi {xi)uiyi = 1,2,--- , iV 

where x^ (•) G S" C M" is the state, fi:S~> M" and gi : S -> 
]gnxm ^j-g jocally Lipschitz functions, and u,j (•) C M™ is the 
control policy. To achieve consensus to the leader, define the 
local neighborhood tracking error e^ (•) G S C M" for each 
agent as ||20) 



T,(x) 






a^j {x^-Xj) + aiQ{xi). 



(2) 



Denote the cardinality of the set Ni by \Ni\. Let £i{-) G 
5IMI+1 g ]g|M|+i be a stacked vector of local neighborhood 
tracking errors corresponding to the agent Pi and its neighbors, 
i.e.. Si = {cj I J G A/i} U {ei}. To achieve consensus in an 
optimal cooperative way, it is desired to minimize, for each 
agent, the cost «/i == 5 /n ''i (^i, Ui) dt, where 

ri {£„ u,) = efQiiCi + uj RiU, + ^ aijejQijCj. (3) 

In ^, R, G M™x™ and Q^i,Q^j G M"""" are sym- 
metric positive definite matrices of constants. Let £ — 



[el el ... elY^ G 5^ 

X^ X'2 ' ' ' "^ N G O 

of Ci from Q we get 
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and X — 
. Using the definition 



£ = 



Ti(a;) 

Tn{x) 



= i{C+Ao)®In)X, 



where (g) denotes the Kronecker product and /„ G 
the identity matrix. 

IV. Control Development 
A. State derivative estimation 



IS 



Based on the development in p9) , each agent's dynamics 
can be approximated using a dynamic neural network (DNN) 
with Alfi hidden layer neurons as 

Xi = Wj^(j{Vj^Xi) + efi{xi) + gi{xi)ui, 

where Wf^ G M^^/.+ix", Vf^ G M"><*'-f/' are unknown ideal 
DNN weights, af, ^ cr{Vj^^x,) G M*^^'+i is a bounded 
DNN activation function, and eji : M" -^ M" is the function 
reconstruction error. In the following, the drift dynamics fi 
are unknown and the control effectiveness functions gi are 



assumed to be known. Each agent estimates the derivative of 
its own state using the following state-derivative estimator 

Xi = fi + gi{xi)ui + iii, fi = Wji&fi, 

Wft ^ proj{V^f^(J'f{VJ,x^xJ), 

Vft = proJiTyf^x^iJWJia'fi), 
fit = kfiXi{t) - kfiXi{0) + Vi, 
Vi = {kfiafi + jfi)xi + I3ifisgn{xi), Vi (0) = 0, (4) 

where Wfi (•) G M^^/'+ix" and Vf, (•) e M^^/»+ix" are the 
estimates for the ideal DNN weights Wfi and Vfi, Xi (•) G M" 
is the state estimate, afi = a{VTxi) G M*^-''+-'-, Xi = xi — 
Xi G M" is the state estimation error, kfi, aji, jfi, /3i/i G M. 
are positive constant control gains, proj {■} is a smooth 
projection operator |211, and Ui (•) G M" is a generalized 
Filippov solution to (|4]l. For notational brevity define 

Fj {xi,Xi,Ui,t) = fi{xi) + gi {xi)u^ + fii (t) , 

^ i \Xi^ Ui j Ji \Xi j '7' gi \Xi) tlij r i r i r i- 

It is shown in p9[ Theorem 1] that provided the gains fc/j and 
7/i are sufficiently large and Xi and Ui are bounded, the esti- 
mation error Xi and its derivative are bounded. Furthermore, 

limt^^oo \\xi {t)\\ = 0, limt_j.oo \\xi {t)\\ = 0, and Ft G £oo- 



where F* {xi,u*) = fi (xi) + gi{xi)u*, and the minimizer in 

§is the optimal policy u* : S'l^^l+i — >■ M"*, which can be 
obtained by solving the equation — 'i ""' ' = 0. Using the 
definition of H* in (l7|, the optimal policy can be written in 
a closed form as 






where V*,, = -g^ 



(8) 
_„JTA* A av;* ■„_.,._., .•„. 



and K: . = 



B. VflZMe function approximation 

The value function V^ : 3^^^^+^ 
each agent given by 

oo 

1 



is the cost-to-go for 



V^{S'^) = ^ / r,(£:,(r),u,(f,(r)))dr, 



(5) 



to 



where 5^ (r) denote the neighborhood tracking error trajecto- 
ries associated with agent Pi and its neighbors, with the initial 
conditions £i (to) = ^°- The time derivative of Vi is then given 
by 

j£iUAfi ^ 

The Hamiltonian for the optimal control problem is the dif- 
ferential equivalent of ([5| given by 



ij, 






The optimal value function F/ : 5l-^'l+i 



is defined as 



^; (sn 



mm 



rj {£i (r) , Mi (fi (t))) dr, (6) 



where C/i denotes the set of all admissible policies for the 
agent /3i |J22j. Assuming that the minimizer in ^ exists, V* 
is the solution to the HJB equation 

BV* 



-g^ , assuming that the optimal 
value function V* satisfies Vi* G C^ and V* (0) = 0. Note 
that the controller for node i only requires the tracking error 
and edge weight information from itself and its neighbors. 
The following assumptions are made to facilitate the use of 
NNs to approximate the optimal policy and the optimal value 
function. 

Assumption 1. The set S is compact. Based on the subsequent 
stability analysis, this assumption holds as long as the initial 
condition X (0) is bounded. See Remark [T] in the subsequent 
stability analysis. 

Assumption 2. Each optimal value function V* can be 
represented using a NN with Mi neurons as 

V*i£,)^WfaU£^)+e^i£^), (9) 



where Wi G 



fMi 



is the ideal weight matrix bounded 



above by a known positive constant Wi G 

that \\Wi\L < Wi, a, : ^l^-l+i - 



pAIi 



I in the sense 
is a bounded 

continuously differentiable nonlinear activation function, and 
Ci : 5"! '1+^ — > M is the function reconstruction error such that 
sup£^ \e^{£i)\ < li and sup^^ ||e^ (gi)|| < e',, where e^ = |g- 



From 



G M are positive constants |23|, |24|. 
0) and ([9]) the optimal policy can be represented as 

(10) 



1 



^-\R-h^,{Lo^W,^L,,), 



where Ld = 



j-jj 



T' 



and L,, ^ i^{a,o + d.) (ffj) - E,eM %^ (ffj) 

Based on ^ and (10 1, the NN approximations to the 
optimal value function and the optimal policy are given by 

1 



V^ 



W,\<J^, 



R-'gi L^Ma. 



(11) 



where Wd (•) G R^'' and Wai (•) G M**' are estimates of 
the ideal neural network weights Wi. Using ([9|-([TT|), the 
approximate Hamiltonian -ffi (•) and the optimal Hamiltonian 
H* (•) can be obtained as 



N 



Hi = e, QiiCi + Uj i?jUj + 2_^ aijCj QijCj + W^.tOi, 



N 



H* = efQiiCi + uf^ RiU* + ^ aijejQijej 



(12) 



^^^^^<F'=E,e^u^,{|^)^AF 



CO, 



E 



da, 
dej 



T 



and 



(^) 



T, {F* 



(13) 



Using (l7| , the error between the approximate and the optimal 
Hamiltonian, called the Bellman error (BE) 6i{-) e M, is given 
in a measurable form by 



5, ^H,-H* 



H,. 



(14) 



Note that equations ([T2|-([T4| imply that to compute the BE, 
the i*'' agent requires the knowledge of T^ IF] and T^ (f) 

for all j & Afi. As each agent can compute its own T^ (Fj 
based on local information, the computation of 6i for each 
agent can be achieved via two-hop local communication. 

The primary contribution of this result is that the developed 
value function approximation scheme, together with the state 
derivative estimator, enables the computation of the BE 5i 
with only local information, and without the knowledge of 
drift dynamics. Furthermore, unlike the previous results such 
as p3) , the effect of the local tracking errors of the neighbors 
of an agent is explicitly considered in the HJB equation for 
that agent, resulting in the novel control law in ([TT}. In the 
following, the update laws for the value function and the 
policy weight estimates based on the BE are presented. The 
update laws and the subsequent development leading up to 
the stability analysis in Section IV] are similar to our previous 
result in p9| with minor changes, and are presented here for 
completeness. 

Note that the BE in ( [T4] i is linear in the value function 
weight estimates Wd and nonlinear in the policy weight 
estimates Wai- The use of two different sets of weights 
to approximate the same ideal weights Wi is motivated by 
the heuristic observation that adaptive update laws based on 
least squares minimization perform better than those based on 
gradient descent. As the appUcation of least squares technique 
requires linearity of the error with respect to the parameters 
being estimated, the use of two different sets of weights 
facilitates the development of a least squares minimization- 
based update law for the value function weights. The value 
function weights are updated to minimize L Sf (r) dr using a 



least squares update law with a forgetting factor as p5|, |26| 



VF« 



It 



-<j)cili 



to. 



1 + V^LOl 7iWi 
-Kli + li 



T 



-7j 



(15) 



(16) 



■ 1 + ViUl JiLO^ 

where Vi,(t)ci € ffi are positive adaptation gains, A^ € (0,1) 
is the forgetting factor for the estimation gain matrix 7i (•) £ 
jjA/ixA/i jjjg policy weights are updated to follow the value 
function weight estimates as 



Wai ^proj |-0ai2 f 



Wa,, - W, 



.)} 



(17) 



where (j)ai2 G K is a positive adaptation gain, and proj {•} is 
a smooth projection operator pT( . The use of forgetting factor 
ensures that 

fi.lMi < li (t) < WilMi, Vi e [to,oo), (18) 

where ipi, ipi £ M. sae constants such that < (pi < ^ 125), 



|26|. Using (12|-(15), an unmeasurable form of the BE can 
be written as 



,5, = -w;f,u, + ^-^ICWa^ 



h^jG^m - \wTGa 



-w, 



Wt 



Y, a,e,T,(GL,Wa + GL,'^ 

EiuM / 

\j€tUAfi J 



(19) 



The weight estimation errors for the value function and the 
poUcy are defined as Wd (t) = Wi- Wc^ (t) and Wat (t) = 
Wi — Wai (t), respectively. Using (14 1, the weight estimation 
error dynamics for the value function can be rewritten as 

w„ = -(t>ctiti't^IWc, + , ,'^"^y' ' 



1 + ViLof JiUJi 



wn E ^.e,T,(F)|-iG,.-<p. 
\jeiuAf, J 

+ \w^ I Y. ^-.^J {GL^^a + GL,) 

+ \wlG,tWat - ^WjG^tWt ~ \wjG,,t J , (20) 

where Oie^ = ffj-, Gi = giR^^gf, G^i = L^igiR^'^gjLai, 
Gd = LdgtR~^gfLd, G<t« = L^^giR'^gfL^ andtp^ (•) = 



e M*^' is the regressor vector Based on ( 18 1 , 



the regressor vector can be bounded as 



I^^WII < 



1 



y^WPl 



yt e [to,oo). 



(21) 



The dynamics in ( 20 ) can be regarded as a perturbed form of 
the nominal system 



W,, 



-<t>ciii'ipii'i Wc, 



(22) 



Using Corollary 4.3.2 in ||26| and Assumption [T] ([22]) is 
globally exponentially stable if the regressor vector ijji : 
[0,oo) -^ M*^' is persistently exciting. Given (fl8|, ([2T|), 



and (22), Theorem 4.14 in |27) can be used to show that 
there exists a function Vd : IR^ x [0, oo) -^ M and positive 



constants Vd, Vd, v^u and Vc2i such that for all t G [to,oo), Using (20), ^l\ and the fact that from (l7|, 

V-^ aV* r^ / T7I*\ * ..■ 1J„ 



dm 









w,, 



at 



< ~Vr 



Wc. 






< Vc2i 



Wr. 



(23) 
(24) 

(25) 



E 



^ Li ^i Wii^i 



^Tj{F*) = -r* yields 



,*T 






E ^-^^d\GL.^^ + \GL^ 



jeiuAfi ^ 



Using Assumptions [T| and [2] the results of Section IV-A 



and the fact the Wai is bounded by projection, the following 
bounds are developed to aid the subsequent stability analysis: 



-(jictli^ti^i Wet + W^^ri, 



'a2i ( 



\^I ( E '^-.^J (GL^Wa + GL, 

E 






W,,. 



dt 

Wc^) 



<L^ 



<i2, 






:G, 



w; 






<t3, 



W^G^.W, - e^j.. . 



(28) 



Wa. 



< ^-4, 



Using the bounds in ( 24 )-( 26 ) the Lyapunov derivative in ( 28 ) 



(26) can be upper-bounded as 



where ti, t2, '-3, i4 G K are computable positive constants. 
V. Stability Analysis 

Theorem 1. Provided Assumptions [7] and l2] hold, and the 
regressor vector ipi : [0, oo) — ?• M^^* is persistently exciting, 



VL<-Qu\\e.f^ J2 



?7a2i 



Wa 



jeM, 






Vcli 



+ ^2 + ^3, 



W^c, 



(29) 



the controller in (111 one/ the update laws in (15) - (17) 



guarantee that the local neighborhood tracking errors for 
agent Pi and its neighbors are UUB. Furthermore, the policy 
and the value function weight estimation errors for agent j3i 
are UUB, resulting in UUB convergence of the policy Ui to 
the optimal policy u*. 



where Qn and Q^, are the minimum eigenvalues of the 
matrices Qu and Qij, respectively and 



''W'c, 



'^ciVc2iVi 



(il +L2+ Va2il'4) 



Proof: Consider the function Vu : S'l-^*l+^ x 



B2Mi 



Lemma 4.3 in p7) along with completion of the squares on 



in (29) yields 



defined as 



1 ~. 



Vu = V* + y„ + ^W;^Wa^, 



Vu {Z„t) < -vu {\\Z,\\) , \f\\Z,\\ > A5» > 0, Vt G [0,oo) 

(30) 



where V* is defined in (J6| and Vd is introduced in (23 i. Using 
the fact that V* is positive definite. Lemma 4.3 from 1271 and 
(pSl yield 



where L^i — v 



2ticl 



t2 + t3 1 , and vii : [0, bi] -> [0, cxi) 



(27) 



VHi\\Z,\\) <VL^{Z^.,t) <VU{\\Z,\\) • 

for all Zi G Bbi and for all t G [to, oo), where 

Z,^[£, Wl W^,^]^GZC5l-^'l+ixM2Af,^ 

vii : [0,5i] — > [0,oo) and vu : [0, 6i] — > [0,oo) are class K. 
functions, and Bu C Z denotes a ball of radius hi G M+ 
around the origin. The time derivative of Vj,,; is 



is a class /C function. Using ( |27| , (30), and Theorem 4.18 in 
pTl, Zi (t) is UUB. ■ 

The conclusion of Theorem [T| is that the local neighborhood 
tracking errors for agent /3i and its neighbors are UUB. Since 
the choice of agent ^i is arbitrary, similar analysis on each 
agent shows that the local neighborhood tracking errors for 
all the agents are UUB. Hence E (t) is UUB. Provided that 
the graph has a spanning tree and at least one of the pinning 
gains aio is nonzero it can be shown that pO), p8). 



\x\\<\\e\\is, 



(31) 



jeiyjMi 



\dW, 



-w,. 



dVc. 



dt 



{wZ^a^) 



where s is the minimum singular value of the matrix C + Aq- 
Thus, Theorem [T] along with (31) shows that the states Xi \ 
i = 1,- ■ ■ ,N aie UUB around the origin. Based on (26i, 




1 



Figure 1. Communication topology and initial conditions. 



the ultimate bound can be made smaller by increasing the 
state penalties Qa and Qij, and by increasing the number 
of neurons in the NN approximation of the value function to 
reduce the approximation errors e,. 

Remark 1. If \\Zi (0)|| > t5i then Vu {Z, (0) , 0) < 0. Thus, 
Vli {Zi (t) , t) is decreasing at t = 0. Thus, Zi (t) G Coo, and 
hence, 8i (t) S £oo at t = 0+. Thus all the conditions of 
Theorem [T] are satisfied at t = 0+. As a result, Vli {Zi it) , t) 
is decreasing at t = 0+. By induction, \\Zi (0)|| > i^i =^ 
Vu {Z, (t) , t) < Vu {Z, (0) , 0) , Vi G M+. Thus, from ([27]), 

\\s^m < \\z,(t)\\ < vu-Hvui\\z^m\))- If 11^^(0)11 < 

t5i then (27) and (30 1 can be used to determine that 

VH{\\Z,it)\\) < VL^{Z^{t),t) < W i\\i5^\[. 

a result, \\Zi {t)\\ < vuT^ (wh (^5*)) • Let 5 e 



Vt € M+. As 
I be defined as 



-s^^ 



N 



vii ^ (vii {max {\\Zi{0)\\,i5i))) 



This relieves Assumption [T] in the sense that the compact set 
S C M" that contains the system trajectories Xi (t) , Vi = 
I,--- ,iV,VteM+ is given by S" = {x G M" | ||x|| <S}. 

VL Simulations 

This section demonstrates the applicability of the devel- 
oped technique. Consider the communication topology of five 
agents with unit pinning gains and edge weights with the initial 
configuration as shown in Figure [T] The dynamics of all the 
agents are chosen as \29j 



-O.Sxji 



cos{2xii) 



0.5x,2(l 



Ui 



Xi2 

(cos(2xii) 



is the state and 



where Xi{t) = [xii{t) ,Xi2{t)Y e 
Ui[t) g i? is the control input. Table IT] summarizes the optimal 
control problem parameters, basis functions, and adaptation 
gains for the agents. In Table HJ e^ denotes the j*'* element 
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Figure 2. State trajectories for the first and the second state variable. 
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Figure 3. Control trajectories. 



Agent 1 
hi =h, Ri ^ 
Qi3 = 0.5 X h 



Agent 2 
= h, -R2 = 1 



Agent 3 

Q33 = h, Rs^T 

Q31 = 0.5 X h 



Agent 4 

Q44 = I2, R4 = I 

Q41 = 0.1 X I2 



Agent 5 

Q54 =h, R^^T 

Q52 = 0.1 X h 



0-3 (£3) = [fill 



0-4 (£4) = [eji, 



<^i (^1) = [e?i: ei2. 

eiiei2, e§i, e§2, 

631632] 
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0"5 (£5) 



^^21' 
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621622, 6§i, 6I2, 

651652, 621651, 
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»?al = 0.1, r?cl = 20, 

i/i = 0.0005 



Va2 = 10, ric2 = 20, 
1/2 = 0.005 



r;a3 = 0.1, »7c3 = 20, 
1/3 = 0.005 



r;a4 = 0.1, r?c4 = 20, 
1/4 = 0.005 



rias = 0.1, r;c5 = 10, 
U5 = 0.0005 



Table I 
Simulation parameters. 
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Figure 4. Value function weights. 



of the vector e^.The value function and the poUcy weights 
are initialized equal to one, and the identifier weights are 
initialized as uniformly distributed random numbers in the 
interval [—1,1]. All the identifiers have five neurons in the 
hidden layer, and the identifier gains are chosen as 

r„/, = 0.1 X 4, kf, = 600, a/, = 300, 
7/. = 5, Pif, - 0.2. 

An exponentially decreasing probing signal is added to the 
controllers to ensure PE. Figures l2] and [3] show the state 
and the control trajectories for all the agents demonstrating 
consensus to the origin. Note that agents 3, 4, and 5 do not 
have a communication link to the leader. In other words, 
agents 3, 4, and 5 do not know that they have to converge 
to the origin. The convergence is achieved via decentralized 
cooperative control. Figure HI shows the evolution of the value 



function weights for the agents. Note that convergence of the 
weights is achieved. Figures |2]|4] demonstrate the applicability 
of the developed method to cooperatively control a system 
of agents with partially unknown nonlinear dynamics on a 
communication topology. Two-hop local communication is 
needed to implement the developed method. Note that since 
the true weights are unknown, this simulation does not gauge 
the optimality of the developed controller. To gauge the 
optimality, a sufficiently accurate solution to the optimization 
problem will be sought via numerical optimization methods. 
The numerical solution will then be compared against the 
solution obtained using the proposed method. 

VII. Conclusion 

This result combines graph theory and graph theory with 
the ACI architecture in ADP to synthesize approximate online 
optimal control policies for agents on a communication net- 
work with a spanning tree. NNs are used to approximate the 
policy, the value function, and the system dynamics. UUB 
convergence of the agent states and the weight estimation 
errors is proved through a Lyapunov-based stability analysis. 
Simulations are presented to demonstrate the applicability of 
the proposed technique to cooperatively control a group of five 
agents. Like other ADP-based results, this result hinges on the 
system states being PE. Furthermore, possible obstacles and 
possible collisions are ignored in this work. Future efforts will 
focus to resolve these limitations. 
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